In our Capstone Project, we aim to explore the relationship between meteorological conditions (Air temperature and rainfall) of Singapore and clinical conditions commonly seen in local community – Dengue fever (DF), hand food mouth disease (HFMD), upper respiratory tract infection (URTI) and diarrhea. Specifically, does higher temperature and rainfall associated with the rise in clinical conditions in the community? Furthermore, we are interested to investigate if types of housing is associated with the number of Dengue fever cases.
Dengue fever is a vector-borne infectious disease that is endemic in the tropical world. Singapore is one of several countries with high disease burden of dengue. In 2020, Singapore saw 1,158 dengue cases in a week of June - the highest number of weekly dengue cases ever recorded since 2014. The sudden increase is worrisome, therefore greater emphasis on analysis will be focused on Dengue fever.
For our data science project, we activated the following packages, using the Tidyverse approach.
# Rule for activating packages in development mode:
# - Specify everything fully
# - Exceptions:
# - ggfortify - to overload `ggplot2::autoplot()`
# - ggmap - has to be activated for `ggmap::register_google()` to work
# - ggplot2 - some function arguments require `ggplot2::`, it's crazy
# - magrittr - `%>%`
#
# For deployment mode, activate every package that is referenced more than once
# - Add package name here, then use find-and-replace to remove "<package>::"
# - Allowed to activate tidyverse
# - Remember what are the 8 packages?
pacman::p_load(
ggfortify,
# ggmap,
ggplot2,
magrittr
# tidyverse
)
Then, we imported our dataset.
Most time-consuming, finding data, is.
We work towards 2 datasets: data_time for temporal analysis, and data_space for spatial analysis. When they are ready, the import-tidy-transform process would have been completed (at least for the first round of transform, subsequent rounds of transforms might still be required depending on the exigencies informed by the modeling and visualization processes).
All data are available online and were gathered into 7 distinct raw datasets (click here for a summary page):
The full glorious details of their providence are given below.
The following functions must be activated no matter which method of data import is chosen.
import_moh_weekly <- function(url_or_path = NULL) {
#' Weekly Infectious Diseases Bulletin
#'
#' @description
#' \href{https://www.moh.gov.sg/resources-statistics/infectious-disease-
#' statistics/2020/weekly-infectious-diseases-bulletin}{MOH Weekly Infectious
#' Disease Bulletin} from the Ministry of Health (MOH).
#'
#' \href{https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/
#' weekly-infectious-disease-bulletin-year-2020
#' d1092fcb484447bc96ef1722b16b0c08.xlsx}{Latest data as of 31 July 2020
#' (2012-W01 to 2020-W30)}.
#'
#' @param url_or_path The URL or file path of the .xlsx file.
#' @return Weekly infectious diseases bulletin (2012-W01 to 2020-W30).
# If no URL or path is specified, try to get the file directly from MOH.
if (is.null(url_or_path)) {
url_or_path = paste0(
"https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/",
"weekly-infectious-disease-bulletin-year-2020",
"d1092fcb484447bc96ef1722b16b0c08.xlsx"
)
}
# Check if the given path is a URL by trying to download to a temp file. If
# successful, return the temp file. If not, return the original path.
xlsx_file = tryCatch({
temp = tempfile(fileext = ".xlsx")
download.file(url_or_path, destfile = temp, mode = "wb")
temp
}, error = function(e) {
url_or_path
})
# Columns will be renamed to follow 2020
colnames_2020 = c(
"Campylobacter enterosis" = "Campylobacter enteritis",
"Campylobacterenterosis" = "Campylobacter enteritis",
"Campylobacteriosis" = "Campylobacter enteritis",
"Chikungunya Fever" = "Chikungunya",
"Dengue Haemorrhagic Fever" = "DHF",
"Dengue Fever" = "Dengue",
"Hand, Foot and Mouth Disease" = "HFMD",
"Hand, Foot Mouth Disease" = "HFMD",
"Nipah virus infection" = "Nipah",
"Viral Hepatitis A" = "Acute Viral Hepatitis A",
"Viral Hepatitis E" = "Acute Viral Hepatitis E",
"Zika Virus Infection" = "Zika",
"Zika virus infection" = "Zika"
)
xlsx_file %>%
readxl::excel_sheets() %>%
lapply(function(sheetname) {
df = readxl::read_xlsx(xlsx_file, sheetname, skip = 1)
# Date formats are different for 2020 (dmy instead of mdy)
if (sheetname == "2020") {
df$Start = lubridate::dmy(df$Start)
df$End = lubridate::dmy(df$End)
}
# Find and rename columns that need to be renamed, and rename them
mapper = na.omit(colnames_2020[names(df)])
dplyr::rename_with(df, ~mapper, names(mapper))
}) %>%
dplyr::bind_rows() %>%
dplyr::rename(Epiweek = `Epidemiology Wk`) %>%
dplyr::mutate(Epiyear = lubridate::epiyear(Start)) %>%
dplyr::select(Epiyear, everything()) %>%
dplyr::arrange(Start)
}
read_kmls <- function(url_or_path) {
#' Read Single or Multiple KML Files
#'
#' @description
#' There are (at least) 2 approaches to handling .kml data:
#' \enumerate{
#' \item sp - rgdal::readOGR()
#' \item sf - sf::st_read()
#' }
#'
#' The sp approach was abandoned as their objects were rather complex and
#' did not facilitate method chaining.
#'
#' The sf approach produced objects that look like data.frames, which had
#' better support for method chaining, but also some peculiarities:
#' \itemize{
#' \item Dimension: set to XY using sf::st_zm()
#' \item Geographic CRS: use sf::`st_crs<-`("WGS84") World Geodetic
#' Survey 1984
#' }
#'
#' @param url_or_path The URL(s) or file path(s) of the .kml file(s).
#' @return A single combined sf object.
# Check if the given paths are URLs by trying to download to temp files. If
# successful, return the temp files. If not, return the original paths.
kml_files = tryCatch({
temp = tempfile(fileext = rep(".kml", length(url_or_path)))
Map(function(x, y) download.file(x, y, mode = "wb"), url_or_path, temp)
temp
}, error = function(e) {
url_or_path
})
kml_files %>%
lapply(sf::st_read, quiet = T) %>%
dplyr::bind_rows() %>%
tibble::as_tibble() %>%
sf::st_as_sf() %>%
sf::st_zm()
}
The following code chunk has been disabled, but is provided here to show the details behind the webscraping processes.
import_mss_daily <- function(years, stations = NULL) {
#' Historical Daily Weather Records
#'
#' @description
#' \href{http://www.weather.gov.sg/climate-historical-daily/}{Daily weather
#' records} from the Meteorological Service Singapore (MSS).
#' Available data ranges from January 1980 to June 2020. Only 19 of the 63
#' climate stations are recognized by this function, because these contain
#' more than just rainfall data. \href{http://www.weather.gov.sg/wp-content/
#' uploads/2016/12/Station_Records.pdf}{List of stations, weather parameters
#' and periods of records}.
#'
#' MSS is nice enough to have their data in .csv files, with a systematic
#' naming scheme. Data compilation becomes as simple as generating the
#' appropriate list of URLs.
#'
#' @param years A vector of the years of interest.
#' @param stations A vector of the climate station names.
#' @return Combined daily weather records.
#' @examples
#' import_mss_daily(2012:2020, "Changi")
#' import_mss_daily(2012:2020, c("Changi", "Clementi", "Khatib", "Newton"))
stations_lookup = c(
"Admiralty" = "104_",
"Ang Mo Kio" = "109_",
"Changi" = "24_",
"Choa Chu Kang (South)" = "121_",
"Clementi" = "50_",
"East Coast Parkway" = "107_",
"Jurong (West)" = "44_",
"Jurong Island" = "117_",
"Khatib" = "122_",
"Marina Barrage" = "108_",
"Newton" = "111_",
"Pasir Panjang" = "116_",
"Pulau Ubin" = "106_",
"Seletar" = "25_",
"Sembawang" = "80_",
"Sentosa Island" = "60_",
"Tai Seng" = "43_",
"Tengah" = "23_",
"Tuas South" = "115_"
)
# Check that all provided station names are in the list, if not, exit and
# print out the list (of names) for the user.
mask = !(stations %in% names(stations_lookup))
if (any(mask)) {
stop("The following station names are not recognized:\n",
paste(stations[mask], collapse = "\n"),
"\n\nPlease select from the following:\n",
paste(names(stations_lookup), collapse = "\n"))
}
# If no station names specified, take the full list
if (is.null(stations)) {
station_nums = stations_lookup
} else {
station_nums = stations_lookup[stations]
}
# The URL to each .csv file has the following format:
# - "<url_base><station number>_<YYYY><MM>.csv"
# - Notice the station numbers given above contain the "_" suffix
url_base = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_S"
# To enumerate all the URLs, take the Cartesian product of:
# - station numbers
# - years
# - months (we'll always attempt to find all 12 months)
#
# Flatten the result into a vector, then prefix and suffix with `url_base`
# and ".csv", respectively.
# Base R
urls = station_nums %>%
outer(years, FUN = paste0) %>%
outer(sprintf("%02d", 1:12), FUN = paste0) %>%
sort() %>% # Flatten and arrange in alphabetical order
paste0(url_base, ., ".csv")
# Equivalent tidyverse approach (for reference)
# urls = station_nums %>%
# tidyr::crossing(years, sprintf("%02d", 1:12)) %>%
# tidyr::unite("station_year_month", sep = "") %>%
# dplyr::pull() %>%
# paste0(url_base, ., ".csv")
# We have a list of URLs, but some of them may not exist (some stations do
# not have data for some months). Use tryCatch() to return a table for the
# URLs that exist, and a NA for those that don't.
#
# Problems with multibyte characters and countermeasures:
# - Some colnames contained the "degree sign" symbol which somehow made the
# table contents inaccessible. Tables after April 2020 had slightly
# different colnames.
# - Manually set the colnames for each table.
# - Some tables contained the "em dash" symbol to indicate missing values.
# - They will be coerced to NA when the columns are type cast to numeric.
# - There will be warning messages.
dfs = urls %>%
lapply(function(url) {
tryCatch({
df = readr::read_csv(url,
skip = 1,
col_names = c(
"Station",
"Year",
"Month",
"Day",
"Rainfall",
"Highest_30_min_rainfall",
"Highest_60_min_rainfall",
"Highest_120_min_rainfall",
"Mean_temp",
"Max_temp",
"Min_temp",
"Mean_wind",
"Max_wind"
),
col_types = readr::cols_only(
"Station" = "c",
"Year" = "n",
"Month" = "n",
"Day" = "n",
"Rainfall" = "n",
"Mean_temp" = "n",
"Max_temp" = "n",
"Min_temp" = "n"
))
# Announce progress (UX is important! We can tolerate lower efficiency)
message(paste(df[1, 1:3], collapse = "_"))
df
}, error = function(e) {
NA
})
})
dfs[!is.na(dfs)] %>%
dplyr::bind_rows() %>%
# Calculate daily temperature range
dplyr::mutate(Temp_range = Max_temp - Min_temp,
.keep = "unused") %>%
# Calculate epidemiological years and weeks
dplyr::mutate(Date = lubridate::ymd(paste(Year, Month, Day, sep = "-")),
Epiyear = lubridate::epiyear(Date),
Epiweek = lubridate::epiweek(Date),
.keep = "unused",
.after = Station) %>%
dplyr::arrange(Station, Date)
}
import_hcidirectory <- function() {
#' Healthcare Institutions Directory
#'
#' @description
#' The \href{http://hcidirectory.sg/hcidirectory/}{Healthcare Institutions
#' (HCI) Directory}, an initiative by the Ministry of Health (MOH), is a
#' platform for all HCIs licensed under the Private Hospitals and Medical
#' Clinics (PHMC) Act to provide information about their services and
#' operations to the public.
#'
#' This function is custom-made to consolidate the names and addresses of
#' HCIs which are medical clinics that offer general medical services.
#'
#' The HCI Directory is a dynamic web page, so using RSelenium might be
#' required.
#'
#' @return The names and addresses of selected HCIs.
# Run a Selenium Server using `RSelenium::rsDriver()`. The parameters e.g.
# `browser`, `chromever` (or `geckover` if using Firefox, or other drivers
# if using other browsers) have to be properly set. Trial-and-error until a
# configuration works. Set `check = T` the very first time it's run on a
# system, then set `check = F` after that to speed things up.
rD = RSelenium::rsDriver(browser = "chrome",
chromever = "83.0.4103.39",
check = F)
# Connect to server with a remoteDriver instance.
remDr = rD$client
# Set timeout on waiting for elements
remDr$setTimeout(type = "implicit", milliseconds = 10000)
# Navigate to the given URL
remDr$navigate("http://hcidirectory.sg/hcidirectory/")
# Click 4 things:
# 1. "MORE SEARCH OPTIONS"
# 2. "Medical Clinics Only"
# 3. "General Medical"
# 4. "Search"
c(
"options" = "#moreSearchOptions",
"medclins" = "#criteria > table > tbody > tr:nth-child(2) > td > label",
"genmed" = "#isGenMed",
"search" = "#search_btn_left"
) %>%
lapply(remDr$findElement, using = "css") %>%
purrr::walk(function(elem) elem$clickElement())
# Find the number of pages
results = remDr$findElement("#results", using = "css")
npages = results$getElementAttribute("innerHTML")[[1]] %>%
xml2::read_html() %>%
rvest::html_node("#totalPage") %>%
rvest::html_attr("value") %>%
as.numeric()
# Create an empty tibble to append results
df = tibble::tibble(
id = character(),
name = character(),
add = character()
)
i = 1
while (T) {
results = remDr$findElement("#results", using = "css")
html = results$getElementAttribute("innerHTML")[[1]] %>%
xml2::read_html()
# Determine the index numbers of the (up to 10) results on the page
idx = html %>%
# Find the element that says "SHOWING 1 - 10 OF 1,761 RESULTS"
rvest::html_nodes(".col1") %>%
.[1] %>%
rvest::html_text() %>%
# Commas have to be eliminated for numbers > 999
gsub(",", "", .) %>%
# Find the smallest and largest numbers and apply the colon operator
sub(".*Showing\\s+(.*)\\s+of.*", "\\1", .) %>%
strsplit(split = " - ") %>%
unlist() %>%
as.numeric() %>%
{ .[1]:.[2] }
# Only append results if IDs are not in the table
if (!any(idx %in% df$id)) {
df = df %>%
dplyr::bind_rows(
html %>%
# Find both the name and address nodes
rvest::html_nodes(".name,.add") %>%
rvest::html_text() %>%
# Tidy whitespace
gsub("\\s+", " ", .) %>%
trimws() %>%
# Concatenate IDs, odd rows (names), and even rows (addresses)
{ cbind(idx, .[c(TRUE,FALSE)], .[!c(TRUE,FALSE)]) } %>%
tibble::as_tibble() %>%
setNames(c("id", "name", "add"))
)
# Announce progress and increment page counter
message(i, " of ", npages, " done (", round(i / npages * 100, 2), "%)")
i = i + 1
}
# Natural exit point
if (i > npages) break
# Navigate to the next page (if available, else stop)
the_end = tryCatch({
nextpage = remDr$findElement("#PageControl > div.r_arrow", using = "css")
nextpage$clickElement()
F
}, error = function(e) {
print(paste("There are no more pages after", i))
T
})
# Unnatural exit point
if (the_end) break
}
# Clean up RSelenium
remDr$close()
rD[["server"]]$stop()
rm(rD, remDr)
gc()
# Kill Java instance(s) inside RStudio
# docs.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
system("taskkill /im java.exe /f", intern = F, ignore.stdout = F)
# Clean up:
# - Franchises may have the same name with different addresses
# - Different practices may have the same zipcodes and even buildings
# - We will consider each full address unique, and a single practice
# Clean up duplicate addresses
df %>%
.[!duplicated(tolower(.$add), fromLast = T),]
}
zipcodes_to_geocodes <- function(zipcodes) {
#' Get Geo-location from Google Maps
#'
#' @description
#' Attempt to obtain the longitudes, latitudes, and addresses of the given
#' zipcodes using ggmap::geocode().
#'
#' @param zipcodes A vector of zipcodes.
#' @return Geo-location data of the associated zipcodes.
# Prompt user to input API key
ggmap::register_google(key = readline("Please enter Google API key: "))
# Create an (almost) empty tibble to append results
res = zipcodes %>%
# Remove duplicates to minimize number of requests
.[!duplicated(.)] %>%
tibble::as_tibble() %>%
dplyr::rename(zip = value) %>%
dplyr::mutate(lon = NA_real_,
lat = NA_real_,
address = NA_character_)
for (i in 1:nrow(res)) {
result = tryCatch({
ggmap::geocode(res$zip[i], output = "latlona", source = "google")
}, warning = function(w) {
w$message
}, error = function(e) {
NA
})
# If the registered key is invalid, there's no point continuing
if (grepl("The provided API key is invalid", result[1], fixed = T)) {
stop("A valid Google API key is required.")
}
# A useful result will have something, and will have names
if (!is.na(result) && !is.null(names(result))) {
res$lon[i] = result$lon
res$lat[i] = result$lat
res$address[i] = result$address
}
# Announce progress
message(i, " of ", nrow(res), " (",round(i / nrow(res) * 100, 2), "%)")
}
res
}
The following code chunk has been disabled, but is provided here to show how the data may be obtained from primary sources (with one exception). Using this code chunk would require activating the previous code chunk, as well as a valid Google Maps Platform API key.
grand_import_de_novo <- function() {
# This might take a while, and requires a Google Maps Platform API key
list(
"moh_bulletin" = import_moh_weekly(),
"mss_19stations" = import_mss_daily(years = 2012:2020),
"hci_clinics" = import_hcidirectory() %>%
dplyr::mutate(zip = sub(".*((?i)singapore\\s+\\d+).*", "\\1", add)) %>%
# Requires Google Maps Platform API key
dplyr::left_join(zipcodes_to_geocodes(.$zip), by = "zip") %>%
dplyr::select(-id, -zip, -address) %>%
tidyr::drop_na(),
"planning_areas" = read_kmls(paste0(
"https://geo.data.gov.sg/plan-bdy-dwelling-type-2017/2017/09/27/kml/",
"plan-bdy-dwelling-type-2017.kml"
)),
"dengue_polys" = read_kmls(
c("central", "northeast", "southeast", "southwest") %>%
paste0("https://geo.data.gov.sg/denguecase-", .,
"-area/2020/07/17/kml/denguecase-", ., "-area.kml")
),
"aedes_polys" = read_kmls(c(
c("central", "northeast", "northwest") %>%
paste0("https://geo.data.gov.sg/breedinghabitat-", .,
"-area/2020/07/17/kml/breedinghabitat-", ., "-area.kml"),
c("southeast", "southwest") %>%
paste0("https://geo.data.gov.sg/breedinghabitat-", .,
"-area/2020/07/23/kml/breedinghabitat-", ., "-area.kml")
)),
"mss_63station_pos" = readr::read_csv(paste0(
"https://raw.githubusercontent.com/roscoelai/dasr2020capstone/master/",
"data/mss/Station_Records.csv"
))
)
}
# raw_data <- grand_import_de_novo()
If the data files are available locally, set from_online_repo = F for faster loading. Otherwise, the file will be obtained from an online repository. The .csv files will be read directly while the other file formats will be downloaded to a temporary folder and read locally.
grand_import_no_webscraping <- function(from_online_repo = TRUE) {
# Allow user to choose whether to import raw data from an online repository
# or from local files.
if (from_online_repo) {
fld = paste0("https://raw.githubusercontent.com/roscoelai/",
"dasr2020capstone/master/data/")
} else {
# Check that the "../data/ folder exists
assertthat::assert_that(dir.exists("../data/"),
msg = 'Unable to locate "../data/" directory.')
fld = "../data/"
}
list(
"moh_bulletin" = import_moh_weekly(paste0(
fld, "moh/weekly-infectious-disease-bulletin-year-2020.xlsx"
)),
"mss_19stations" = readr::read_csv(paste0(
fld, "mss/mss_daily_2012_2020_19stations_20200728.csv"
)),
"hci_clinics" = readr::read_csv(paste0(
fld, "hcid/hci_clinics_20200725.csv"
)),
"planning_areas" = read_kmls(paste0(
fld, "kmls/plan-bdy-dwelling-type-2017.kml"
)),
"dengue_polys" = read_kmls(paste0(
fld, "kmls/denguecase-", c("central",
"northeast",
"southeast",
"southwest"), "-area.kml"
)),
"aedes_polys" = read_kmls(paste0(
fld, "kmls/breedinghabitat-", c("central",
"northeast",
"northwest",
"southeast",
"southwest"), "-area.kml"
)),
"mss_63station_pos" = readr::read_csv(paste0(
fld, "mss/Station_Records.csv"
))
)
}
raw_data <- grand_import_no_webscraping(from_online_repo = F)
raw_data %T>% {
print(tibble::tibble(names = names(.),
data = .,
nrow = sapply(., nrow),
ncol = sapply(., ncol)))
}
# A tibble: 7 x 4
names data nrow ncol
<chr> <named list> <int> <int>
1 moh_bulletin <tibble [470 x 46]> 470 46
2 mss_19stations <tibble [58,489 x 7]> 58489 7
3 hci_clinics <tibble [1,739 x 4]> 1739 4
4 planning_areas <tibble [55 x 3]> 55 3
5 dengue_polys <tibble [1,203 x 3]> 1203 3
6 aedes_polys <tibble [315 x 3]> 315 3
7 mss_63station_pos <tibble [63 x 9]> 63 9
$moh_bulletin
# A tibble: 470 x 46
Epiyear Epiweek Start End Cholera Paratyphoid
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl>
1 2012 1 2012-01-01 00:00:00 2012-01-07 00:00:00 0 1
2 2012 2 2012-01-08 00:00:00 2012-01-14 00:00:00 0 1
3 2012 3 2012-01-15 00:00:00 2012-01-21 00:00:00 0 1
4 2012 4 2012-01-22 00:00:00 2012-01-28 00:00:00 0 1
5 2012 5 2012-01-29 00:00:00 2012-02-04 00:00:00 0 3
6 2012 6 2012-02-05 00:00:00 2012-02-11 00:00:00 0 2
7 2012 7 2012-02-12 00:00:00 2012-02-18 00:00:00 0 1
8 2012 8 2012-02-19 00:00:00 2012-02-25 00:00:00 0 4
9 2012 9 2012-02-26 00:00:00 2012-03-03 00:00:00 0 4
10 2012 10 2012-03-04 00:00:00 2012-03-10 00:00:00 0 2
# ... with 460 more rows, and 40 more variables: Typhoid <dbl>, `Acute Viral
# Hepatitis A` <dbl>, `Acute Viral Hepatitis E` <dbl>, Poliomyelitis <dbl>,
# Plague <dbl>, `Yellow Fever` <dbl>, Dengue <dbl>, DHF <dbl>, Malaria <dbl>,
# Chikungunya <dbl>, HFMD <dbl>, Diphtheria <dbl>, Measles <dbl>,
# Mumps <dbl>, Rubella <dbl>, SARS <dbl>, Nipah <dbl>, `Acute Viral hepatitis
# B` <dbl>, Encephalitis <dbl>, Legionellosis <dbl>, `Campylobacter
# enteritis` <dbl>, `Acute Viral hepatitis C` <dbl>, Leptospirosis <dbl>,
# Melioidosis <dbl>, `Meningococcal Infection` <dbl>, Pertussis <dbl>,
# `Pneumococcal Disease (invasive)` <dbl>, `Haemophilus influenzae type
# b` <dbl>, `Salmonellosis(non-enteric fevers)` <dbl>, `Avian
# Influenza` <dbl>, Zika <dbl>, `Ebola Virus Disease` <dbl>, `Japanese
# Encephalitis` <dbl>, Tetanus <dbl>, Botulism <dbl>, `Murine Typhus` <dbl>,
# `Acute Upper Respiratory Tract infections` <dbl>, `Acute
# Conjunctivitis` <dbl>, `Acute Diarrhoea` <dbl>, Chickenpox <dbl>
$mss_19stations
# A tibble: 58,489 x 7
Station Date Epiyear Epiweek Rainfall Mean_temp Temp_range
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Admiralty 2012-01-01 2012 1 0 27.3 7.40
2 Admiralty 2012-01-02 2012 1 0 27.1 5.8
3 Admiralty 2012-01-03 2012 1 0 26.9 4.60
4 Admiralty 2012-01-04 2012 1 0 26.8 7
5 Admiralty 2012-01-05 2012 1 0 26.4 6.5
6 Admiralty 2012-01-06 2012 1 0 27.1 6.40
7 Admiralty 2012-01-07 2012 1 1.4 26.5 6.20
8 Admiralty 2012-01-08 2012 2 3.4 24.8 2.40
9 Admiralty 2012-01-09 2012 2 2.6 25.4 3.4
10 Admiralty 2012-01-10 2012 2 8.6 24.7 2.80
# ... with 58,479 more rows
$hci_clinics
# A tibble: 1,739 x 4
name add lon lat
<chr> <chr> <dbl> <dbl>
1 DA CLINIC @ BISHAN BLK 501 BISHAN STREET 11 #01-374 Singapor~ 104. 1.35
2 SINGHEALTH POLYCLINIC~ 1 TAMPINES STREET 41 TAMPINES POLYCLINIC ~ 104. 1.36
3 1 MEDICAL TECK GHEE BLK 410 ANG MO KIO AVENUE 10 TECK GHEE SQ~ 104. 1.36
4 115 EASTPOINT CLINIC ~ BLK 115 BEDOK NORTH RD #01-301 Singapore ~ 104. 1.33
5 18 CLINIC BLK 101 TOWNER ROAD #01-228 Singapore 322~ 104. 1.32
6 1AESTHETICS, MEDICAL ~ 8 EU TONG SEN STREET THE CENTRAL #14-90 S~ 104. 1.29
7 326 AVENUE 3 CLINIC BLK 326 SERANGOON AVE 3 #01-382 Singapore~ 104. 1.35
8 338 FAMILY CLINIC BLK 338 ANG MO KIO AVE 1 #01-1615 Singapo~ 104. 1.36
9 57 MEDICAL CLINIC (GE~ BLK 57 GEYLANG BAHRU #01-3505 Singapore 3~ 104. 1.32
10 8 MEDICAL AESTHETIC C~ 51 CUPPAGE ROAD #01-03 Singapore 229469 104. 1.30
# ... with 1,729 more rows
$planning_areas
Simple feature collection with 55 features and 2 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
geographic CRS: WGS 84
# A tibble: 55 x 3
Name Description geometry
<chr> <chr> <MULTIPOLYGON [°]>
1 kml_1 "<center><table><tr><th colsp~ (((103.8572 1.396537, 103.8574 1.39629~
2 kml_2 "<center><table><tr><th colsp~ (((103.9319 1.343094, 103.9355 1.33956~
3 kml_3 "<center><table><tr><th colsp~ (((103.8492 1.362753, 103.8494 1.36268~
4 kml_4 "<center><table><tr><th colsp~ (((103.6973 1.307544, 103.6973 1.30755~
5 kml_5 "<center><table><tr><th colsp~ (((103.7641 1.370011, 103.7644 1.36947~
6 kml_6 "<center><table><tr><th colsp~ (((103.8172 1.294353, 103.8174 1.29433~
7 kml_7 "<center><table><tr><th colsp~ (((103.7745 1.390289, 103.775 1.386071~
8 kml_8 "<center><table><tr><th colsp~ (((103.7977 1.348128, 103.7981 1.34778~
9 kml_9 "<center><table><tr><th colsp~ (((103.9018 1.309745, 103.9015 1.30954~
10 kml_10 "<center><table><tr><th colsp~ (((103.9081 1.309816, 103.9082 1.30910~
# ... with 45 more rows
$dengue_polys
Simple feature collection with 1203 features and 2 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 103.6848 ymin: 1.269624 xmax: 103.967 ymax: 1.414322
geographic CRS: WGS 84
# A tibble: 1,203 x 3
Name Description geometry
<chr> <chr> <POLYGON [°]>
1 kml_1 "<center><table><tr><th colsp~ ((103.843 1.322077, 103.843 1.323886, ~
2 kml_2 "<center><table><tr><th colsp~ ((103.8484 1.327504, 103.8484 1.329312~
3 kml_3 "<center><table><tr><th colsp~ ((103.8681 1.327503, 103.8681 1.329312~
4 kml_4 "<center><table><tr><th colsp~ ((103.843 1.3474, 103.843 1.349208, 10~
5 kml_5 "<center><table><tr><th colsp~ ((103.8448 1.3474, 103.8448 1.349208, ~
6 kml_6 "<center><table><tr><th colsp~ ((103.8591 1.322077, 103.8591 1.323886~
7 kml_7 "<center><table><tr><th colsp~ ((103.8699 1.322077, 103.8699 1.323886~
8 kml_8 "<center><table><tr><th colsp~ ((103.8322 1.320269, 103.8322 1.322077~
9 kml_9 "<center><table><tr><th colsp~ ((103.834 1.320269, 103.834 1.322077, ~
10 kml_10 "<center><table><tr><th colsp~ ((103.8555 1.320269, 103.8555 1.322077~
# ... with 1,193 more rows
$aedes_polys
Simple feature collection with 315 features and 2 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 103.6848 ymin: 1.27505 xmax: 103.9652 ymax: 1.446879
geographic CRS: WGS 84
# A tibble: 315 x 3
Name Description geometry
<chr> <chr> <POLYGON [°]>
1 kml_1 "<center><table><tr><th colsp~ ((103.8537 1.370913, 103.8537 1.372722~
2 kml_2 "<center><table><tr><th colsp~ ((103.8448 1.367296, 103.8448 1.369104~
3 kml_3 "<center><table><tr><th colsp~ ((103.8358 1.378148, 103.8358 1.379957~
4 kml_4 "<center><table><tr><th colsp~ ((103.8861 1.378147, 103.8861 1.379956~
5 kml_5 "<center><table><tr><th colsp~ ((103.8537 1.365487, 103.8537 1.367296~
6 kml_6 "<center><table><tr><th colsp~ ((103.8466 1.370913, 103.8466 1.372722~
7 kml_7 "<center><table><tr><th colsp~ ((103.8717 1.37453, 103.8717 1.376339,~
8 kml_8 "<center><table><tr><th colsp~ ((103.8843 1.376339, 103.8843 1.378147~
9 kml_9 "<center><table><tr><th colsp~ ((103.8897 1.390808, 103.8897 1.392617~
10 kml_10 "<center><table><tr><th colsp~ ((103.8286 1.314842, 103.8286 1.316651~
# ... with 305 more rows
$mss_63station_pos
# A tibble: 63 x 9
Station `Lat.(N)` `Long. (E)` `Period of Dail~ `Period of 30,6~
<chr> <dbl> <dbl> <chr> <chr>
1 Paya L~ 1.35 104. Jan 1980-current <NA>
2 Macrit~ 1.34 104. Jan 1980-current Jan 2014-current
3 Lower ~ 1.37 104. Jan 2010-current Jan 2014-current
4 Tengah 1.39 104. Jan 1980-current <NA>
5 Changi 1.37 104. Jan 1981-current Jan 2014-current
6 Seletar 1.42 104. Jan 1980-current <NA>
7 Pasir ~ 1.39 104. Jan 1980-current Jan 2014-current
8 Kampon~ 1.27 104. Jan 1980-current Jan 2014-Oct 20~
9 Jurong~ 1.31 104. Jan 1980-current Jan 2014-current
10 Ulu Pa~ 1.33 104. Jan 1980-current Jan 2014-current
# ... with 53 more rows, and 4 more variables: `Period of Mean
# Temperature` <chr>, `Period of Max and Min Temperature` <chr>, `Period of
# Mean Wind Speed` <chr>, `Period of Max Wind Speed` <chr>
Behold! Our 7 raw datasets, nicely packaged in the list raw_data. No more importing should be necessary from this point.
data_timeThe unit of analysis for data_time is the epidemiological week.
The data from the MOH bulletin is already organized this way. The only things left to do are to select the relevant columns and possibly provide “easier” column names.
The meteorological data time period (2012-2020) matches the MOH bulletin, more or less. However, it’s units are in days. Furthermore, the dates are duplicated for 19 climate stations. To match the MOH bulletin’s national-level data, the meteorological data must be aggregated to the level of the epidemiological week (and year).
How many values are we aggregating for each variable, for each week?
raw_data$mss_19stations %>%
tidyr::pivot_longer(Rainfall:Temp_range) %>%
tidyr::drop_na() %>%
dplyr::count(Epiyear, Epiweek, name) %>%
dplyr::select(name, n) %T>%
{ print(table(.)) } %>%
ggplot(aes(x = n, color = name)) +
geom_histogram(binwidth = 1) +
facet_grid(name ~ ., scales = "free") +
labs(x = "Number of values", y = "Number of weeks") +
theme(legend.position = "none")
n
name 46 51 56 71 73 81 86 89 90 92 94 95 96 97 98 99
Mean_temp 1 0 0 1 0 2 1 1 3 2 2 3 2 2 4 5
Rainfall 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Temp_range 0 1 0 0 1 1 0 0 1 0 0 2 1 0 1 4
n
name 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
Mean_temp 4 4 3 7 7 5 12 10 14 26 16 16 18 5 4 1
Rainfall 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 0
Temp_range 0 2 0 2 1 0 2 1 2 8 4 5 5 4 8 4
n
name 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
Mean_temp 2 8 4 6 1 1 5 6 13 9 66 2 10 11 16 23
Rainfall 2 4 7 7 6 2 3 4 11 9 70 8 15 11 23 19
Temp_range 4 14 11 20 16 19 19 15 25 7 51 3 2 4 5 7
n
name 132 133
Mean_temp 25 55
Rainfall 38 195
Temp_range 10 152
Now we join the datasets to get data_time.
data_time <- raw_data$moh_bulletin %>%
dplyr::select(Epiyear:End,
Dengue,
HFMD,
`Acute Upper Respiratory Tract infections`,
`Acute Diarrhoea`) %>%
dplyr::rename(URTI = `Acute Upper Respiratory Tract infections`,
Diarrhoea = `Acute Diarrhoea`) %>%
dplyr::left_join(
raw_data$mss_19stations %>%
dplyr::group_by(Epiyear, Epiweek) %>%
dplyr::summarise(Mean_rainfall = mean(Rainfall, na.rm = T),
Med_rainfall = median(Rainfall, na.rm = T),
Mean_temp = mean(Mean_temp, na.rm = T),
Med_temp = median(Mean_temp, na.rm = T),
Mean_temp_rng = mean(Temp_range, na.rm = T),
Med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop"),
by = c("Epiyear", "Epiweek")
) %>%
tidyr::drop_na()
# Data Cleaning
mask <- data_time$End < "2012-01-01" | data_time$End > "2020-07-31"
# Must personally see what's wrong
data_time[mask,]
# A tibble: 1 x 14
Epiyear Epiweek Start End Dengue HFMD URTI
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
1 2018 1 2017-12-31 00:00:00 1900-01-01 00:00:00 83 326 3158.
# ... with 7 more variables: Diarrhoea <dbl>, Mean_rainfall <dbl>,
# Med_rainfall <dbl>, Mean_temp <dbl>, Med_temp <dbl>, Mean_temp_rng <dbl>,
# Med_temp_rng <dbl>
# Correct it - thankfully the start date seems ok
data_time[mask, "End"] <- data_time$Start[mask] + lubridate::days(6)
# Then check that it's corrected
data_time[mask,]
# A tibble: 1 x 14
Epiyear Epiweek Start End Dengue HFMD URTI
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
1 2018 1 2017-12-31 00:00:00 2018-01-06 00:00:00 83 326 3158.
# ... with 7 more variables: Diarrhoea <dbl>, Mean_rainfall <dbl>,
# Med_rainfall <dbl>, Mean_temp <dbl>, Med_temp <dbl>, Mean_temp_rng <dbl>,
# Med_temp_rng <dbl>
Let’s see what inside data_time
dplyr::glimpse(data_time)
Rows: 444
Columns: 14
$ Epiyear <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...
data_time contains 10 features: 4 indicators of disease numbers and 6 aggregated meteorological measures. Each value is associated with an epidemiological week. The time period spans from 2012-W01 to 2020-W30. As the meteorological records for July 2020 have not been published, there are no corresponding values for 2020-W28 and 2020-W29.
data_spaceA lot more wrangling is involved.
The unit of analysis for data_space is the planning area. This is specified by raw_data$planning_areas, and the other datasets would have to be transformed to match before merging.
Spatial analysis would be rather limited and more emphasis placed on visualization.
grand_transform_space <- function(raw_data) {
data = list(
"dengue_points" = raw_data$dengue_polys,
"aedes_points" = raw_data$aedes_polys
) %>%
lapply(function(df) {
df %>%
dplyr::transmute(n = sub(".*: (\\d+).*", "\\1", Description)) %>%
sf::st_centroid() %>%
.[rep(1:nrow(.), as.numeric(.$n)),]
})
data[["clinic_points"]] = raw_data$hci_clinics %>%
sf::st_as_sf(coords = c("lon", "lat")) %>%
sf::`st_crs<-`("WGS84")
data[["weather_points"]] = raw_data$mss_19stations %>%
dplyr::filter(Epiyear == 2020) %>%
# Filter for the last 3 weeks
dplyr::filter(Epiweek > max(Epiweek) - 3) %>%
# Aggregation schema (up to 7 days x up to 3 weeks -> 1 value)
dplyr::group_by(Station) %>%
dplyr::summarise(mean_rainfall = mean(Rainfall, na.rm = T),
med_rainfall = median(Rainfall, na.rm = T),
mean_temp = mean(Mean_temp, na.rm = T),
med_temp = median(Mean_temp, na.rm = T),
mean_temp_rng = mean(Temp_range, na.rm = T),
med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop") %>%
tidyr::drop_na() %>%
dplyr::left_join(
raw_data$mss_63station_pos %>%
dplyr::select(Station:`Long. (E)`),
by = "Station"
) %>%
sf::st_as_sf(coords = c("Long. (E)", "Lat.(N)")) %>%
sf::`st_crs<-`("WGS84")
# Because we have weather_points, we can define IDW interpolation and
# immediately join with planning_areas.
idw_interpolation <- function(points, polys, ordinal = 2) {
# Inverse-distance-weighted (IDW) interpolation
# - Very large ordinal: Proximity (Thiessen) interpolation
# - Large ordinal: Heavily weighted by inverse-distance
# - Moderate ordinal: Weighted by inverse-distance
# - Zero ordinal: Un-weighted average
# - Negative ordinal: Nonsense! (more weight to further stations)
# Matrix of inverse-distances raised to some power (npolys x npoints)
weights = polys %>%
sf::st_centroid() %>%
sf::st_distance(points) %>%
units::set_units(km) %>%
{ 1 / (. ^ ordinal) } # (55 x 16)
values = points %>%
as.data.frame() %>%
dplyr::select(-Station, -geometry) %>%
as.matrix() # (16 x 6)
# (55 x 16) * (16 x 6) => (55 x 6)
(weights %*% values / rowSums(weights)) %>%
tibble::as_tibble() %>%
dplyr::mutate(plan_area = polys$plan_area)
}
data[["planning_areas"]] = raw_data$planning_areas %>%
dplyr::bind_cols(
# Extract data from HTML in the Description column (dwelling types)
.$Description %>%
lapply(function(x) {
xml2::read_html(x) %>%
rvest::html_node("table") %>%
rvest::html_table() %>%
t() %>%
`colnames<-`(.[1,]) %>%
.[2, 1:10]
}) %>%
dplyr::bind_rows()
) %>%
dplyr::select(-Name, -Description) %>%
dplyr::rename_all(tolower) %>%
dplyr::rename(plan_area = pln_area_n,
pop = total) %>%
dplyr::mutate(plan_area = tools::toTitleCase(tolower(plan_area)),
dplyr::across(pop:others, as.numeric)) %>%
tibble::as_tibble() %>%
sf::st_as_sf() %>%
# Add meteorological data
dplyr::left_join(
idw_interpolation(data$weather_points, ., ordinal = 10),
by = "plan_area"
)
data
}
data <- grand_transform_space(raw_data)
data_space <- list(
ncases = data$dengue_points,
nhabs = data$aedes_points,
nclinics = data$clinic_points
) %>%
{
# Count the number of points in each polygon
lapply(names(.), function(name) {
.[[name]] %>%
sf::st_intersects(data$planning_areas) %>%
tibble::as_tibble() %>%
dplyr::mutate(plan_area = data$planning_areas$plan_area[col.id]) %>%
dplyr::count(plan_area, name = name)
})
} %>%
# Join everything
Reduce(function(x, y) dplyr::left_join(x, y, by = "plan_area"), .) %>%
dplyr::left_join(data$planning_areas, by = "plan_area") %>%
sf::st_as_sf() %>%
dplyr::select(-one_to_two_rm:-five_rm_exec_flats, -others) %>%
dplyr::mutate(
area_km2 = units::set_units(sf::st_area(.), km^2),
ncases_pht = ncases / pop * 100000,
nclinics_pht = nclinics / pop * 100000,
hdb_pp = hdb / pop,
condos_other_apts_pp = condos_other_apts / pop,
landed_properties_pp = landed_properties / pop,
popden = as.numeric(pop / area_km2),
habsden = as.numeric(nhabs / area_km2),
clinicsden = as.numeric(nclinics / area_km2),
caseden = as.numeric(ncases / area_km2),
label = paste0(
"<b>", plan_area,
"</b><br/>Area: ", round(area_km2, 2),
"<br/>Cases: ", ncases,
"<br/>Clinics: ", nclinics,
"<br/>Population: ", pop,
"<br/><i>Aedes</i> habitats: ", nhabs
)
)
dplyr::glimpse(data_space)
Rows: 29
Columns: 26
$ plan_area <chr> "Ang Mo Kio", "Bedok", "Bishan", "Bukit Batok"...
$ ncases <int> 74, 260, 61, 44, 36, 12, 18, 16, 347, 188, 7, ...
$ nhabs <int> 32, 72, 8, 11, 5, 5, 2, 2, 44, 25, 2, 6, 24, 6...
$ nclinics <int> 73, 101, 31, 52, 76, 32, 39, 44, 59, 74, 40, 7...
$ pop <dbl> 166820, 284930, 90280, 138290, 152790, 76380, ...
$ hdb <dbl> 137530, 185520, 62600, 105790, 140400, 7450, 1...
$ condos_other_apts <dbl> 11850, 52270, 16370, 25760, 10430, 35820, 1631...
$ landed_properties <dbl> 16220, 44630, 10630, 5730, 490, 32260, 2460, 5...
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((103.8572 1...., M...
$ mean_rainfall <dbl> 9.494118, 10.738648, 9.499668, 10.728192, 14.7...
$ med_rainfall <dbl> 2.0000000, 3.2393580, 2.0071371, 1.6073903, 1....
$ mean_temp <dbl> 27.98235, 28.30682, 27.98177, 27.69870, 28.215...
$ med_temp <dbl> 28.20000, 28.32052, 28.19905, 27.80008, 28.280...
$ mean_temp_rng <dbl> 5.229412, 4.115153, 5.228916, 5.450107, 4.9553...
$ med_temp_rng <dbl> 5.200000, 3.895949, 5.199374, 5.415854, 4.8043...
$ area_km2 [km^2] 13.9413792 [km^2], 21.7331261 [km^2], 7.61892...
$ ncases_pht <dbl> 44.359190, 91.250483, 67.567568, 31.817196, 23...
$ nclinics_pht <dbl> 43.75974, 35.44730, 34.33762, 37.60214, 49.741...
$ hdb_pp <dbl> 0.82442153, 0.65110729, 0.69339832, 0.76498662...
$ condos_other_apts_pp <dbl> 0.07103465, 0.18344857, 0.18132477, 0.18627522...
$ landed_properties_pp <dbl> 0.097230548, 0.156634963, 0.117744794, 0.04143...
$ popden <dbl> 11965.8176, 13110.4011, 11849.4478, 12421.3671...
$ habsden <dbl> 2.2953253, 3.3129150, 1.0500175, 0.9880327, 0....
$ clinicsden <dbl> 5.236211, 4.647284, 4.068818, 4.670700, 5.2549...
$ caseden <dbl> 5.3079397, 11.9633042, 8.0063836, 3.9521307, 2...
$ label <chr> "<b>Ang Mo Kio</b><br/>Area: 13.94<br/>Cases: ...
data_space contains 19 main features for each planning area (URA MP14), including: number of dengue cases, number of Aedes mosquito habitats, number of clinics, population, various dwelling types, area, and meteorological variables. Planning areas with no record of dengue cases were excluded, leaving 33 of the 55 planning areas (a large part due to unavailability of data for the north-west region).
# data_time %T>%
# dplyr::glimpse() %>%
# dplyr::select(-matches("Epiweek|End")) %>%
# dplyr::mutate(Epiyear = as.factor(Epiyear)) %>%
# tidyr::pivot_longer(Dengue:Med_temp_rng) %>%
# ggplot(aes(x = Start, y = value, color = Epiyear)) +
# geom_line() +
# geom_point(alpha = 0.3) +
# facet_grid(name ~ ., scales = "free_y") +
# labs(title = "Weekly history of variables from 2012 to 2020",
# x = "",
# y = "",
# caption = "Sources: moh.gov.sg, weather.gov.sg") +
# theme(legend.position = "none")
data_space %>%
dplyr::select(ncases:pop, mean_rainfall:med_temp_rng, geometry) %>%
gridExtra::grid.arrange(grobs = lapply(list(
ggplot(aes(fill = ncases), data = .),
ggplot(aes(fill = nhabs), data = .),
ggplot(aes(fill = nclinics), data = .),
ggplot(aes(fill = pop), data = .),
ggplot(aes(fill = mean_rainfall), data = .),
ggplot(aes(fill = med_rainfall), data = .),
ggplot(aes(fill = mean_temp), data = .),
ggplot(aes(fill = med_temp), data = .),
ggplot(aes(fill = mean_temp_rng), data = .),
ggplot(aes(fill = med_temp_rng), data = .)
), function(x) {
x +
geom_sf() +
scale_fill_viridis_c(guide = guide_colourbar(
title.position = "top",
title.hjust = .5,
barwidth = 10,
barheight = .5
)) +
theme_void() +
theme(legend.position = "bottom")
}), ncol = 2)
data_space %>%
dplyr::select(ncases:pop, area_km2, geometry) %>%
dplyr::mutate_at(dplyr::vars(-geometry), ~(. / area_km2)) %>%
dplyr::select(-area_km2) %>%
dplyr::mutate_at(dplyr::vars(-geometry), as.numeric) %>%
gridExtra::grid.arrange(grobs = lapply(list(
ggplot(aes(fill = ncases), data = .),
ggplot(aes(fill = nhabs), data = .),
ggplot(aes(fill = nclinics), data = .),
ggplot(aes(fill = pop), data = .)
), function(x) {
x +
geom_sf() +
scale_fill_viridis_c(guide = guide_colourbar(
title.position = "top",
title.hjust = .5,
barwidth = 10,
barheight = .5
)) +
theme_void() +
theme(legend.position = "bottom")
}), ncol = 2)
data_space %>%
leaflet::leaflet(width = "100%") %>%
leaflet::addTiles() %>%
leaflet::addPolygons(
weight = 1,
opacity = 1,
fillOpacity = 0.6,
smoothFactor = 0.5,
fillColor = ~leaflet::colorNumeric("Reds", caseden)(caseden),
label = ~lapply(label, htmltools::HTML),
popup = ~lapply(label, htmltools::HTML)
) %>%
leaflet::addCircleMarkers(
data = data$dengue_points,
color = "red",
radius = 5,
fillOpacity = 0.5,
clusterOptions = leaflet::markerClusterOptions()
) %>%
leaflet::addLabelOnlyMarkers(
data = sf::st_centroid(data_space),
label = ~plan_area,
labelOptions = leaflet::labelOptions(
noHide = T,
textOnly = T,
direction = "center",
style = list("color" = "blue"))
)
Let’s plot the x variables (Mean_rainfall, Med_temp, Med_temp_rng) and Y (Dengue)
{par(mfrow = c(3, 2))
hist(data_time$Dengue, breaks=40)
hist(data_time$Mean_rainfall, breaks=40)
hist(data_time$Mean_temp, breaks=40)
hist(data_time$Med_temp, breaks=40)
hist(data_time$Mean_temp_rng, breaks=40)
hist(data_time$Med_temp_rng, breaks=40)
}
ks.test(data_time$Mean_temp, data_time$Med_temp)
Two-sample Kolmogorov-Smirnov test
data: data_time$Mean_temp and data_time$Med_temp
D = 0, p-value = 1
alternative hypothesis: two-sided
ks.test(data_time$Mean_temp_rng, data_time$Med_temp_rng)
Two-sample Kolmogorov-Smirnov test
data: data_time$Mean_temp_rng and data_time$Med_temp_rng
D = 0.065315, p-value = 0.2999
alternative hypothesis: two-sided
Mean_temp and Med_temp have similar distributions, and Mean_temp_rng and Med_temp_rng have similar distributions, therefore Med_temp and Med_temp_rng were used.
Let’s run a correlation model to see the relationship between variables
data_time %>%
dplyr::select(Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
# dplyr::filter(p.value < 0.05) %>%
dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>%
dplyr::arrange(estimate)
# A tibble: 6 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 Med_temp Mean_rainfall -0.499 444 0
2 Med_temp_rng Dengue -0.151 444 0.001
3 Mean_rainfall Dengue -0.052 444 0.278
4 Med_temp_rng Med_temp 0.031 444 0.51
5 Med_temp_rng Mean_rainfall 0.084 444 0.077
6 Med_temp Dengue 0.201 444 0
Run a linear regression model
model1_Dengue <- lm(Dengue ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
broom::tidy(model1_Dengue) %>%
dplyr::mutate(dplyr::across(is.numeric, ~round(., 3)))
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1237. 378. -3.27 0.001
2 Mean_rainfall 4.01 2.46 1.63 0.104
3 Med_temp 61.9 13.2 4.70 0
4 Med_temp_rng -43.3 12.1 -3.59 0
Let’s check the skewness of y variable
data_time %>%
tidyr::drop_na() %>%
{ e1071::skewness(.$Dengue) } # [1] 2.033787
[1] 2.033787
Let’s see how the density plots look like after transformation
{
par(mfrow = c(2, 2))
plot(density(data_time$Dengue, na.rm = T), main = "untransformed")
plot(density(sqrt(data_time$Dengue), na.rm = T), main = "sqrt")
plot(density(log10(data_time$Dengue), na.rm = T), main = "log10")
plot(density(1 / data_time$Dengue, na.rm = T), main = "inverse")
}
Not really good. Try them on Model 1
model1_log_Dengue <- lm(log10(Dengue)~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time)
broom::tidy(model1_log_Dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.309 0.700 -0.441 0.659
2 Mean_rainfall 0.00294 0.00456 0.644 0.520
3 Med_temp 0.0993 0.0244 4.06 0.0000572
4 Med_temp_rng -0.0407 0.0224 -1.82 0.0695
gvlma::gvlma(model1_log_Dengue)
Call:
lm(formula = log10(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
-0.309087 0.002939 0.099330 -0.040711
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_log_Dengue)
Value p-value Decision
Global Stat 28.1144 1.182e-05 Assumptions NOT satisfied!
Skewness 1.9927 1.581e-01 Assumptions acceptable.
Kurtosis 17.8056 2.447e-05 Assumptions NOT satisfied!
Link Function 0.1611 6.882e-01 Assumptions acceptable.
Heteroscedasticity 8.1550 4.294e-03 Assumptions NOT satisfied!
model1_sqrt_Dengue <- lm(sqrt(Dengue)~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time)
broom::tidy(model1_sqrt_Dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -28.7 11.2 -2.57 0.0105
2 Mean_rainfall 0.0782 0.0727 1.08 0.283
3 Med_temp 1.74 0.390 4.47 0.00000986
4 Med_temp_rng -0.974 0.357 -2.73 0.00655
gvlma::gvlma(model1_sqrt_Dengue)
Call:
lm(formula = sqrt(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
-28.71254 0.07821 1.74274 -0.97433
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_sqrt_Dengue)
Value p-value Decision
Global Stat 32.7985 1.313e-06 Assumptions NOT satisfied!
Skewness 22.7656 1.830e-06 Assumptions NOT satisfied!
Kurtosis 0.2636 6.077e-01 Assumptions acceptable.
Link Function 0.5967 4.398e-01 Assumptions acceptable.
Heteroscedasticity 9.1726 2.457e-03 Assumptions NOT satisfied!
model1_inv_Dengue <- lm((1/Dengue)~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time)
broom::tidy(model1_inv_Dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0529 0.0141 3.76 0.000193
2 Mean_rainfall -0.0000488 0.0000916 -0.533 0.594
3 Med_temp -0.00164 0.000491 -3.34 0.000896
4 Med_temp_rng 0.000310 0.000449 0.689 0.491
gvlma::gvlma(model1_inv_Dengue)
Call:
lm(formula = (1/Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
0.0528561 -0.0000488 -0.0016408 0.0003096
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_inv_Dengue)
Value p-value Decision
Global Stat 200.447 0.000e+00 Assumptions NOT satisfied!
Skewness 133.155 0.000e+00 Assumptions NOT satisfied!
Kurtosis 41.572 1.136e-10 Assumptions NOT satisfied!
Link Function 1.142 2.853e-01 Assumptions acceptable.
Heteroscedasticity 24.578 7.135e-07 Assumptions NOT satisfied!
Let’s have a look at the models’ adjusted R2 values
list(
"untransformed" = model1_Dengue,
"log" = model1_log_Dengue,
"sqrt" = model1_sqrt_Dengue,
"inv" = model1_inv_Dengue
) %>%
lapply(broom::glance) %>%
{ dplyr::bind_cols(names(.), dplyr::bind_rows(.)) } %>%
dplyr::rename(name = ...1)
New names:
* NA -> ...1
# A tibble: 4 x 13
name r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 untr~ 0.0709 0.0646 2.02e+2 11.2 4.34e-7 3 -2985. 5981.
2 log 0.0472 0.0407 3.75e-1 7.26 9.19e-5 3 -193. 395.
3 sqrt 0.0599 0.0535 5.98e+0 9.34 5.34e-6 3 -1422. 2854.
4 inv 0.0292 0.0226 7.53e-3 4.41 4.53e-3 3 1543. -3075.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
It seems that model1_Dengue (untransformed) has a higher adjusted R2 value than the other models, and it explains about 6% of the variance.
Plot the regression models
data_time %>%
dplyr::select(Dengue, Med_temp, Med_temp_rng, Mean_rainfall) %>%
# scale() %>%
tibble::as_tibble() %>%
tidyr::pivot_longer(-Dengue) %>%
ggplot(aes(x = value, y = Dengue, color = name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
facet_grid(. ~ name, scales = "free") +
theme(legend.position = "none")
24 hr temperature mean per month is 27.5 mean Rainfall is 13.8
Since rainfall has marginal significance at an alpha of 10% and known to have effect on vector born diseases, we decided to explore rainfall variable by categorizing it (<11mm of rain fall and >=11mm of rainfall)
data_time$Mean_rainfall %>%
hist()
data_time <- data_time %>%
dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))
# glimpse(data_time)
Re-Run Model1 With categorised Rain variable
model2_Dengue <- lm(Dengue~Rain_11+Med_temp+Med_temp_rng,
data=data_time)
gvlma::gvlma(model2_Dengue)
Call:
lm(formula = Dengue ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time)
Coefficients:
(Intercept) Rain_11 Med_temp Med_temp_rng
-1219.75 62.59 61.56 -42.21
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model2_Dengue)
Value p-value Decision
Global Stat 803.970 0.000e+00 Assumptions NOT satisfied!
Skewness 218.166 0.000e+00 Assumptions NOT satisfied!
Kurtosis 556.830 0.000e+00 Assumptions NOT satisfied!
Link Function 2.482 1.151e-01 Assumptions acceptable.
Heteroscedasticity 26.493 2.645e-07 Assumptions NOT satisfied!
broom::tidy(model2_Dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1220. 351. -3.48 0.000556
2 Rain_11 62.6 28.1 2.22 0.0266
3 Med_temp 61.6 12.3 5.01 0.000000805
4 Med_temp_rng -42.2 12.0 -3.53 0.000463
Add Interaction term
model2_Dengue_Int <- lm(Dengue ~ (Med_temp + Med_temp_rng) * Rain_11,
data=data_time)
gvlma::gvlma(model2_Dengue_Int)
Call:
lm(formula = Dengue ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time)
Coefficients:
(Intercept) Med_temp Med_temp_rng
-1187.07 56.90 -26.90
Rain_11 Med_temp:Rain_11 Med_temp_rng:Rain_11
-1861.85 97.09 -111.60
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model2_Dengue_Int)
Value p-value Decision
Global Stat 513.22 0.000e+00 Assumptions NOT satisfied!
Skewness 166.77 0.000e+00 Assumptions NOT satisfied!
Kurtosis 310.38 0.000e+00 Assumptions NOT satisfied!
Link Function 16.41 5.092e-05 Assumptions NOT satisfied!
Heteroscedasticity 19.65 9.293e-06 Assumptions NOT satisfied!
broom::tidy(model2_Dengue_Int)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1187. 382. -3.10 0.00204
2 Med_temp 56.9 13.1 4.34 0.0000179
3 Med_temp_rng -26.9 13.3 -2.02 0.0435
4 Rain_11 -1862. 1006. -1.85 0.0649
5 Med_temp:Rain_11 97.1 39.4 2.46 0.0142
6 Med_temp_rng:Rain_11 -112. 32.7 -3.41 0.000702
anova(model2_Dengue,model2_Dengue_Int)
Analysis of Variance Table
Model 1: Dengue ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: Dengue ~ (Med_temp + Med_temp_rng) * Rain_11
Res.Df RSS Df Sum of Sq F Pr(>F)
1 440 17909105
2 438 17396320 2 512786 6.4554 0.001726 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(model2_Dengue)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0757 0.0694 202. 12.0 1.44e-7 3 -2984. 5979. 5999.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model2_Dengue_Int)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.102 0.0919 199. 9.97 4.82e-9 5 -2978. 5970. 5998.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
There is an interaction effect (Rain_11 on 2 other predictors) and Anova test shows adding the interaction terms improves the model fit (p value=0.0017). This interaction model has a higher R2 value too.
r.squared(Model2_Dengue) –> 0.0757 r.squared(Model2_Dengue_Int) –> 0.102 adj.r.squared(Model2_Dengue) –> 0.0694 adj.r.squared(Model2_Dengue_Int) –> 0.0919
Unfortunately, all model assumptions failed. So This model won’t be a good one for prediction
Lets explore this interaction model
#Report the model summary and significance using export_summs() function
library(jtools)
jtools::export_summs(model2_Dengue,model2_Dengue_Int,
error_format = "(p = {p.value})",
model.names = c("Main Effects Only",
"Interaction Effects"),
digits = 3)
| Main Effects Only | Interaction Effects | |
|---|---|---|
| (Intercept) | -1219.750 *** | -1187.070 ** |
| (p = 0.001) | (p = 0.002) | |
| Rain_11 | 62.587 * | -1861.847 |
| (p = 0.027) | (p = 0.065) | |
| Med_temp | 61.565 *** | 56.900 *** |
| (p = 0.000) | (p = 0.000) | |
| Med_temp_rng | -42.213 *** | -26.895 * |
| (p = 0.000) | (p = 0.044) | |
| Med_temp:Rain_11 | 97.086 * | |
| (p = 0.014) | ||
| Med_temp_rng:Rain_11 | -111.602 *** | |
| (p = 0.001) | ||
| N | 444 | 444 |
| R2 | 0.076 | 0.102 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Probing Interaction Effects Run Spotlight analysis, using Johnson-Neyman Techniques
# a. Median Temperature
interactions::sim_slopes(model2_Dengue_Int,
pred = Rain_11,
modx = Med_temp_rng,
johnson_neyman = T) #7.01
JOHNSON-NEYMAN INTERVAL
When Med_temp_rng is OUTSIDE the interval [7.01, 9.04], the slope of
Rain_11 is p < .05.
Note: The range of observed values of Med_temp_rng is [3.50, 8.50]
SIMPLE SLOPES ANALYSIS
Slope of Rain_11 when Med_temp_rng = 5.63 (- 1 SD):
Est. S.E. t val. p
-------- ------- -------- ------
223.99 53.48 4.19 0.00
Slope of Rain_11 when Med_temp_rng = 6.43 (Mean):
Est. S.E. t val. p
-------- ------- -------- ------
134.46 38.04 3.53 0.00
Slope of Rain_11 when Med_temp_rng = 7.24 (+ 1 SD):
Est. S.E. t val. p
------- ------- -------- ------
44.92 37.56 1.20 0.23
interactions::interact_plot(model2_Dengue_Int,
pred="Med_temp",
modx = "Rain_11",
modx.labels = c("less than 11mm of rain fall",
"rain fall >=11mm"),
interval = T,
int.width = .95,
colors = c("deeppink",
"darkgreen"),
vary.lty = T,
line.thickness = 1,
legend.main = "") +
ggplot2:: geom_vline(xintercept = 27.15, col = "red", linetype = 3, size = 1)+
labs(title = "Effect of Rain fall and Temperature on Dengue cases in Singapore",
subtitle = "With an average weekly rainfall of 11mm and above,the number of Dengue cases\nincreases with rise in temperature",
x = "Weekly Median Temperature",
y = "Number of Dengue Cases",
caption = "Source: MOH, NEA Websites") +
annotate("text",
x = 28,
y = 0,
label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") +
theme(legend.position = "top")
# b. Median Temperature Range
interactions::sim_slopes(model2_Dengue_Int,
pred = Rain_11,
modx = Med_temp,
johnson_neyman = T) #27.15
JOHNSON-NEYMAN INTERVAL
When Med_temp is OUTSIDE the interval [23.43, 27.15], the slope of Rain_11
is p < .05.
Note: The range of observed values of Med_temp is [25.06, 29.96]
SIMPLE SLOPES ANALYSIS
Slope of Rain_11 when Med_temp = 27.11 (- 1 SD):
Est. S.E. t val. p
------- ------- -------- ------
52.49 28.96 1.81 0.07
Slope of Rain_11 when Med_temp = 27.96 (Mean):
Est. S.E. t val. p
-------- ------- -------- ------
134.46 38.04 3.53 0.00
Slope of Rain_11 when Med_temp = 28.80 (+ 1 SD):
Est. S.E. t val. p
-------- ------- -------- ------
216.42 65.35 3.31 0.00
interactions::interact_plot(model2_Dengue_Int,
pred="Med_temp_rng",
modx = "Rain_11",
modx.labels = c("less than 11mm of rain fall",
"rain fall >=11mm"),
interval = T,
int.width = .95,
colors = c("deeppink",
"darkgreen"),
vary.lty = T,
line.thickness = 1,
legend.main = "Rain Fall")+
ggplot2:: geom_vline(xintercept = 7.01, col = "red", linetype = 3, size = 1)+
labs(title = "Effect of Rain fall and Temperature Variation on Dengue Cases in Singapore",
subtitle = "With an average weekly rainfall of 11mm and above,the number of Dengue cases\ndecreases when tempeature variation is HIGH",
x = "Weekly Median Temperature Range",
y = "Number of Dengue Cases",
caption = "Source: MOH, NEA Websites") +
annotate("text",
x = 6,
y = 100,
label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") +
theme(legend.position = "top")
Since the number of URTI cases in Singapore drastically reduced due to COVID-19, we decided to exclude the values since April 2020 (from circuit breaker period), for this analysis
Let’s plot the x variables (Mean_rainfall,Med_temp,Med_temp_rng) and Y (URTI)
data_time_URTI <- data_time %>%
dplyr::filter(URTI >= 2000)
{
par(mfrow = c(2, 2))
hist(data_time_URTI$Mean_rainfall, breaks=40)
hist(data_time_URTI$Med_temp, breaks=40)
hist(data_time_URTI$Med_temp_rng, breaks=40)
hist(data_time_URTI$URTI, breaks=40)
}
Let’s run a correlation model to see the relationship between variables
data_time_URTI %>%
dplyr::select(URTI, Mean_rainfall, Med_temp, Med_temp_rng) %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
# dplyr::filter(p.value < 0.05) %>%
dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>%
dplyr::arrange(estimate)
# A tibble: 6 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 Med_temp Mean_rainfall -0.508 429 0
2 Med_temp URTI -0.195 429 0
3 Med_temp_rng URTI -0.086 429 0.075
4 Mean_rainfall URTI -0.042 429 0.383
5 Med_temp_rng Med_temp 0.027 429 0.577
6 Med_temp_rng Mean_rainfall 0.091 429 0.06
Run a simple linear regression model
model1_URTI <- lm(URTI~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time_URTI)
broom::tidy(model1_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6804. 729. 9.34 5.59e-19
2 Mean_rainfall -15.7 4.74 -3.31 1.01e- 3
3 Med_temp -133. 25.5 -5.22 2.75e- 7
4 Med_temp_rng -30.5 23.2 -1.31 1.91e- 1
gvlma::gvlma(model1_URTI)
Call:
lm(formula = URTI ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time_URTI)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
6804.45 -15.68 -132.97 -30.45
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_URTI)
Value p-value Decision
Global Stat 119.874 0.000e+00 Assumptions NOT satisfied!
Skewness 48.234 3.782e-12 Assumptions NOT satisfied!
Kurtosis 48.947 2.630e-12 Assumptions NOT satisfied!
Link Function 3.667 5.550e-02 Assumptions acceptable.
Heteroscedasticity 19.026 1.290e-05 Assumptions NOT satisfied!
Assumption checks failed Let’s check the skewness of y variable
e1071::skewness(data_time_URTI$URTI) #[1] 0.8635183
[1] 0.8634216
#Lets see how the density plots look like after transformation
{
par(mfrow = c(2, 2))
plot(density(data_time_URTI$URTI), main = "untransformed")
plot(density(sqrt(data_time_URTI$URTI)), main = "sqrt")
plot(density(log10(data_time_URTI$URTI)), main = "log10")
plot(density(1 / data_time_URTI$URTI), main = "inverse")
}
Try them on Model 1
model1_log_URTI <- lm(log10(URTI)~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time_URTI)
broom::tidy(model1_log_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.03 0.110 36.8 3.38e-134
2 Mean_rainfall -0.00232 0.000713 -3.26 1.22e- 3
3 Med_temp -0.0195 0.00383 -5.09 5.50e- 7
4 Med_temp_rng -0.00523 0.00350 -1.50 1.35e- 1
gvlma::gvlma(model1_log_URTI)
Call:
lm(formula = log10(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time_URTI)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
4.034784 -0.002320 -0.019479 -0.005233
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_log_URTI)
Value p-value Decision
Global Stat 22.357 0.0001701 Assumptions NOT satisfied!
Skewness 8.284 0.0039991 Assumptions NOT satisfied!
Kurtosis 1.968 0.1606558 Assumptions acceptable.
Link Function 3.159 0.0755032 Assumptions acceptable.
Heteroscedasticity 8.946 0.0027807 Assumptions NOT satisfied!
model1_sqrt_URTI <- lm(sqrt(URTI)~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time_URTI)
broom::tidy(model1_sqrt_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 89.6 6.76 13.3 7.25e-34
2 Mean_rainfall -0.144 0.0439 -3.29 1.09e- 3
3 Med_temp -1.22 0.236 -5.16 3.73e- 7
4 Med_temp_rng -0.303 0.215 -1.41 1.60e- 1
gvlma::gvlma(model1_sqrt_URTI)
Call:
lm(formula = sqrt(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time_URTI)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
89.6374 -0.1444 -1.2182 -0.3032
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_sqrt_URTI)
Value p-value Decision
Global Stat 52.462 1.104e-10 Assumptions NOT satisfied!
Skewness 23.091 1.545e-06 Assumptions NOT satisfied!
Kurtosis 12.591 3.877e-04 Assumptions NOT satisfied!
Link Function 3.415 6.462e-02 Assumptions acceptable.
Heteroscedasticity 13.365 2.563e-04 Assumptions NOT satisfied!
model1_inv_URTI <- lm((1/URTI)~Mean_rainfall+Med_temp+Med_temp_rng,
data=data_time_URTI)
broom::tidy(model1_inv_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.000108 0.0000902 -1.20 0.231
2 Mean_rainfall 0.00000185 0.000000586 3.16 0.00168
3 Med_temp 0.0000154 0.00000315 4.89 0.00000143
4 Med_temp_rng 0.00000475 0.00000288 1.65 0.0997
gvlma::gvlma(model1_inv_URTI)
Call:
lm(formula = (1/URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time_URTI)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
-1.083e-04 1.854e-06 1.541e-05 4.745e-06
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model1_inv_URTI)
Value p-value Decision
Global Stat 6.13667 0.18917 Assumptions acceptable.
Skewness 0.31090 0.57713 Assumptions acceptable.
Kurtosis 0.03011 0.86224 Assumptions acceptable.
Link Function 2.65447 0.10326 Assumptions acceptable.
Heteroscedasticity 3.14119 0.07634 Assumptions acceptable.
Assumptions acceptable
Plot the regression model 1
data_time_URTI %>%
dplyr::select(URTI, Med_temp, Med_temp_rng,Mean_rainfall) %>%
# scale() %>%
tibble::as_tibble() %>%
tidyr::pivot_longer(-URTI) %>%
ggplot(aes(x = value, y = URTI, color = name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
facet_grid(. ~ name, scales = "free") +
theme(legend.position = "none")
From the plot, there are more cases when rain fall is lesser. So we decided to explore rainfall variable by categorizing it. (<11cm of rain fall and >=11cm of rainfall))
data_time_URTI <- data_time_URTI %>%
dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))
# glimpse(data_time_URTI)
Re-Run Model1 With categorised Rain variable
model2_inv_URTI<- lm((1/URTI)~Rain_11+Med_temp+Med_temp_rng,
data=data_time_URTI)
gvlma::gvlma(model2_inv_URTI)
Call:
lm(formula = (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time_URTI)
Coefficients:
(Intercept) Rain_11 Med_temp Med_temp_rng
-3.362e-05 1.536e-05 1.290e-05 5.447e-06
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model2_inv_URTI)
Value p-value Decision
Global Stat 7.11459 0.1300 Assumptions acceptable.
Skewness 0.06411 0.8001 Assumptions acceptable.
Kurtosis 0.01816 0.8928 Assumptions acceptable.
Link Function 3.41266 0.0647 Assumptions acceptable.
Heteroscedasticity 3.61966 0.0571 Assumptions acceptable.
broom::tidy(model2_inv_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0000336 0.0000842 -0.399 0.690
2 Rain_11 0.0000154 0.00000683 2.25 0.0251
3 Med_temp 0.0000129 0.00000296 4.36 0.0000162
4 Med_temp_rng 0.00000545 0.00000288 1.89 0.0589
Add Interaction term
model2_inv_URTI_Int <- lm((1/URTI)~(Med_temp+Med_temp_rng)*Rain_11,
data=data_time_URTI)
gvlma::gvlma(model2_inv_URTI_Int)
Call:
lm(formula = (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11,
data = data_time_URTI)
Coefficients:
(Intercept) Med_temp Med_temp_rng
-1.161e-04 1.590e-05 5.151e-06
Rain_11 Med_temp:Rain_11 Med_temp_rng:Rain_11
8.132e-04 -3.302e-05 1.568e-05
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = model2_inv_URTI_Int)
Value p-value Decision
Global Stat 4.8987 0.29786 Assumptions acceptable.
Skewness 0.1427 0.70559 Assumptions acceptable.
Kurtosis 0.3799 0.53765 Assumptions acceptable.
Link Function 0.7336 0.39173 Assumptions acceptable.
Heteroscedasticity 3.6424 0.05632 Assumptions acceptable.
broom::tidy(model2_inv_URTI_Int)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.000116 0.0000917 -1.27 0.206
2 Med_temp 0.0000159 0.00000315 5.05 0.000000645
3 Med_temp_rng 0.00000515 0.00000318 1.62 0.106
4 Rain_11 0.000813 0.000254 3.21 0.00144
5 Med_temp:Rain_11 -0.0000330 0.0000101 -3.27 0.00116
6 Med_temp_rng:Rain_11 0.0000157 0.00000837 1.87 0.0617
anova(model2_inv_URTI,model2_inv_URTI_Int)
Analysis of Variance Table
Model 1: (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11
Res.Df RSS Df Sum of Sq F Pr(>F)
1 425 9.6001e-07
2 423 9.3611e-07 2 2.3896e-08 5.3988 0.004839 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(model2_inv_URTI)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0534 0.0467 4.75e-5 7.99 3.43e-5 3 3664. -7317. -7297.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model2_inv_URTI_Int)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0769 0.0660 4.70e-5 7.05 2.40e-6 5 3669. -7324. -7296.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Median temperature is significant at an alpha level of 5%. There is an interaction effect too (Rain_11 on Med_temp) and Anova test shows adding the interaction terms improves the model fit (p value = 0.004838). This interaction model has a higher R2 value too.
r.squared(model2_inv_URTI) –> 0.053 r.squared(model2_inv_URTI_Int)–> 0.077 adj.r.squared(model2_inv_URTI) –> 0.0467 adj.r.squared(model2_inv_URTI_Int)–> 0.0660
Report the model summary and significance using export_summs() function
jtools::export_summs(model2_inv_URTI,model2_inv_URTI_Int,
error_format = "(p = {p.value})",
model.names = c("Main Effects Only",
"Interaction Effects"),
digits = 3)
| Main Effects Only | Interaction Effects | |
|---|---|---|
| (Intercept) | -0.000 | -0.000 |
| (p = 0.690) | (p = 0.206) | |
| Rain_11 | 0.000 * | 0.001 ** |
| (p = 0.025) | (p = 0.001) | |
| Med_temp | 0.000 *** | 0.000 *** |
| (p = 0.000) | (p = 0.000) | |
| Med_temp_rng | 0.000 | 0.000 |
| (p = 0.059) | (p = 0.106) | |
| Med_temp:Rain_11 | -0.000 ** | |
| (p = 0.001) | ||
| Med_temp_rng:Rain_11 | 0.000 | |
| (p = 0.062) | ||
| N | 429 | 429 |
| R2 | 0.053 | 0.077 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Probing Interaction Effects
#Run Spotlight analysis, using Johnson-Neyman Techniques
# Median Temperature
# interactions::sim_slopes(model2_inv_URTI_Int,
# pred = Rain_11,
# modx = Med_temp,
# johnson_neyman = T) ##[27.28, 28.61]
#
# #Run interaction_plot() by adding benchmark for regions of significance
# # Use plot B to see the actual URTI numbers instead of y inverse
# # a. Median Temperature
# interactions::interact_plot(model2_inv_URTI_Int,
# pred="Med_temp",
# modx = "Rain_11",
# modx.labels = c("less than 11cm of rain fall",
# "rain fall >=11cm"),
# interval = T,
# int.width = .95,
# colors = c("deeppink",
# "darkgreen"),
# vary.lty = T,
# line.thickness = 1,
# legend.main = "") +
# ggplot2:: geom_vline(xintercept = 27.28, col = "red", linetype = 3, size = 1)+
# labs(title = "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore",
# subtitle = "For lower weekly rainfall, the number of URTI cases\ndereases with rise in temperature",
# x = "Weekly Median Temperature",
# y = "1/Number of URTI Cases",
# caption = "Source: MOH, NEA websites") +
# annotate("text",
# x = 28,
# y = 0.0003,
# label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") +
# theme(legend.position = "top")
##############################################
#This model model2_URTI is just for plotting purpose.
model2_URTI_Int <- lm(URTI~(Med_temp+Med_temp_rng)*Rain_11,
data=data_time_URTI)
#Run Spotlight analysis, using Johnson-Neyman Techniques
# Median Temperature
# sim_slopes(model2_URTI_Int,
# pred = Rain_11,
# modx = Med_temp,
# johnson_neyman = F)
interactions::sim_slopes(model2_URTI_Int,
pred = Rain_11,
modx = Med_temp,
johnson_neyman = T) #[27.24, 28.48]
JOHNSON-NEYMAN INTERVAL
When Med_temp is OUTSIDE the interval [27.24, 28.48], the slope of Rain_11
is p < .05.
Note: The range of observed values of Med_temp is [25.06, 29.96]
SIMPLE SLOPES ANALYSIS
Slope of Rain_11 when Med_temp = 27.09 (- 1 SD):
Est. S.E. t val. p
--------- ------- -------- ------
-149.99 56.54 -2.65 0.01
Slope of Rain_11 when Med_temp = 27.94 (Mean):
Est. S.E. t val. p
------- ------- -------- ------
79.97 79.73 1.00 0.32
Slope of Rain_11 when Med_temp = 28.78 (+ 1 SD):
Est. S.E. t val. p
-------- -------- -------- ------
309.93 137.96 2.25 0.03
interactions::interact_plot(model2_URTI_Int,
pred="Med_temp",
modx = "Rain_11",
modx.labels = c("less than 11cm of rain fall",
"rain fall >=11cm"),
interval = T,
int.width = .95,
colors = c("deeppink",
"darkgreen"),
vary.lty = T,
line.thickness = 1,
legend.main = "") +
ggplot2::geom_vline(xintercept = 27.24, col = "red", linetype = 3, size = 1)+
ggplot2::geom_vline(xintercept = 28.48, col = "red", linetype = 3, size = 1)+
labs(title = "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore",
subtitle = "For lower weekly rainfall, the number of URTI cases\ndereases with rise in temperature",
x = "Weekly Median Temperature",
y = "Number of URTI Cases",
caption = "Source: MOH, NEA websites") +
annotate("text",
x = 28,
y = 2000,
label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") +
theme(legend.position = "top")
The regression analysis came up with two significant interaction terms.
First, it appears that the relationships between median temperature and number of cases of dengue is different depending on one’s gender.
Second, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s education.
To see the patterns of interaction, we visualized the significant interaction effects on the next chapter.
From our data science project, we could find the following two findings:
The relationships between the tendency of people to compare themselves to others’ abilities and life satisfaction differ depending on one’s gender. Specifically, there appears to be no relationship between social comparisons orientation and life satisfaction among males. On the other hand, among females, the more they compare their abilities with others, there seems to be lesser life satisfaction. Thus, social comparison seems to be harmful for life satisfaction among females, whereas there seems to be no relationships between social comparisons and life satisfaction among males. (You might want to highlight that the relationships between social comparisons orientation and life satisfaction is based on observational study, leading to correlations, not causations, in Limitations and Future Directions section below).
The relationships between social comparison and life satisfaction also depends on one’s education level. It appears that, among those who received average and high levels of education, the greater the social comparison tendency, the lower the life satisfaction. Such a negative relationship between social comparison and life satisfaction was not found among those with relatively lower levels of education. Social comparison regarding one’s abilities can hurt one’s life satisfaction, when one receives average and above-average levels of education (again, acknowledge that the findings are correlational though, thus further investigation with a/b testing should follow).
Prof. Roh’s Note: “This is where you provide the significance of the findings. Unlike the other sections, where your goal is to describe the results that you found (what the data told you). This is where you chime in and proactively discuss the meaning of the results.”
Prof. Roh’s Note: “Acknowledging limitations is not where you just provide a laundry list of what is missing and what should have done. Please take the responsibility of your analyses and inform your readers to understand what the results tell and don’t (or can’t) tell. More importantly, this is the section where you technically begin your next data science project, by highlighting what would be informative down the line to shed further light on what you have found here.”
References
Appendix
Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”